Finding local community structure in networks 



Aaron Clauset 
Department of Computer Science, 
University of New Mexico, Albuquerque NM 87131 
aaronOcs . unm . edu 
(Dated: February 2, 2008) 

Although the inference of global community structure in networks has recently become a topic of 
great interest in the physics community, all such algorithms require that the graph be completely 
known. Here, we define both a measure of local community structure and an algorithm that infers 
the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time. 
This algorithm runs in time 0(k 2 d) for general graphs when d is the mean degree and k is the number 
of vertices to be explored. For graphs where exploring a new vertex is time-consuming, the running 
time is linear, O(k). We show that on computer-generated graphs this technique compares favorably 
to algorithms that require global knowledge. We also use this algorithm to extract meaningful local 
clustering information in the large recommender network of an online retailer and show the existence 
of mesoscopic structure. 



I. INTRODUCTION 

Recently, physicists have become increasingly inter- 
ested in representing the patterns of interactions in com- 
plex systems as networks 0, 0, 0, Q ■ Canonical examples 
include the Internet Q, the World Wide Web 0, social 
networ ks B , citation networks y|, 13 and biological net- 
works |lf| . In each case, the system is modeled as a 
graph with n vertices and m edges, e.g., physical con- 
nections between computers, friendships between people 
and citations among academic papers. 

Within these networks, the global organization of ver- 
tices into communities has garnered broad interest both 
inside and beyond the physics community. Convention- 
ally, a community is taken to be a group of vertices in 
which there are more edges between vertices within the 
group than to vertices outside of it. Although the par- 
titioning of a network into such groups is a well-studied 
problem, older algorithms tend to only work well in spe- 
cial cases [Illll^ll3llT3 . ll5| . Several algorithms have re- 
cently been proposed within the physics community, and 
have been shown to reliably extract known community 
structure in real world networks [H El III 13 HI E| . 
Similarly, the computer science community has proposed 
algorithms based on the concept of flow |22| . 

However, each of these algorithms require knowledge of 
the entire structure of the graph. This constraint is prob- 
lematic for networks like the World Wide Web, which for 
all practical purposes is too large and too dynamic to ever 
be known fully, or networks which are lar ger than can be 
accommodated by the fastest algorithms [21j . In spite of 
these limitations, we would still like to make quantita- 
tive statements about community structure, albeit con- 
fined to some accessible and known region of the graph 
in question. For instance, we might like to quantify the 
local communities of either a person given their social 
network, or a particular website given its local topology 
in the World Wide Web. 

Here, we propose a general measure of local commu- 
nity structure, which we call local modularity, for graphs 



in which we lack global knowledge and which must be 
explored one vertex at a time. We then define a fast 
agglomerative algorithm that maximizes the local modu- 
larity in a greedy fashion, and test the algorithm's perfor- 
mance on a series of computer-generated networks with 
known community structure. Finally, we use this algo- 
rithm to analyze the local community structure of the on- 
line retailer Amazon. corn's recommender network, which 
is composed of more than 400 000 vertices and 2 million 
edges. Through this analysis, we demonstrate the ex- 
istence of mesoscopic network structure that is distinct 
from both the microstructure of vertex statistics and the 
global community structure previously given in [2lj . In- 
terestingly, we find a wide variety of local community 
structures, and that generally, the local modularity of 
the network surrounding a vertex is negatively correlated 
with its degree. 



II. LOCAL MODULARITY 

The inference of community structure can generally be 
reduced to identifying a partitioning of the graph that 
maximizes some quantitative notion of community struc- 
ture. However, when we lack global knowledge of the 
graph's topology, a measure of community structure must 
necessarily be independent of those global properties. For 
instance, this requirement precludes the use of the mod- 
ularity metric Q, due to Newman and Girvan |l7j . as it 
depends on m. 

Suppose that in the graph Q, we have perfect knowl- 
edge of the connectivity of some set of vertices, i.e., the 
known portion of the graph, which we denote C. This 
necessarily implies the existence of a set of vertices U 
about which we know only their adjacencies to C. Fur- 
ther, let us assume that the only way we may gain addi- 
tional knowledge about Q is by visiting some neighboring 
vertex Vi G U, which yields a list of its adjacencies. As 
a result, Vi becomes a member of C, and additional un- 
known vertices may be added to Li. This vertex-at-a-time 
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discovery process is directly analogous to the manner in 
which "spider" or "crawler" programs harvest the hyper- 
link structure of the World Wide Web. 

The adjacency matrix of such a partially explored 
graph is given by 



1 if vertices i and j are connected, 
An = { and either vertex is in C 
otherwise. 



(1) 



If we consider C to constitute a local community, the 
most simple measure of the quality of such a partitioning 
of G is simply the fraction of known adjacencies that are 
completely internal to C. This quantity is given by 
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where m* = | ■ Aij , the number of edges in the partial 
adjacency matrix, and is 1 if both Vi and Vj are 

in C and otherwise. This quantity will be large when C 
has many internal connections, and few connections to 
the unknown portion of the graph. This measure also 
has the property that when \C\ \U\, the partition will 
almost always appear to be good. 

If we restrict our consideration to those vertices in the 
subset of C that have at least one neighbor in U, i.e., the 
vertices which make up the boundary of C (Fig. 0, we 
obtain a direct measure of the sharpness of that bound- 
ary. Additionally, this measure is independent of the size 
of the enclosed community. Intuitively, we expect that a 
community with a sharp boundary will have few connec- 
tions from its boundary to the unknown portion of the 
graph, while having a greater proportion of connections 
from the boundary back into the local community In the 
interest of keeping the notation concise, let us denote 
those vertices that comprise the boundary as £>, and the 
boundary- adjacency matrix as 



1 if vertices i and j are connected, 

and either vertex is in B 
otherwise. 



Thus, we define the local modularity R to be 



R = 



Si j Bij 
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where 5(i 7 j) is 1 when either i>i € B and Vj £ C or vice 
versa, and is otherwise. Here, T is the number of edges 
with one or more endpoints in B, while I is the number of 
those edges with neither endpoint in IA. This measure as- 
sumes an unweighted graph, although the weighted gen- 
eralization is straightforward [23l |. 

A few comments regarding this formulation are worth- 
while before proceeding. By considering the fraction 
of boundary edges which are internal to C, we ensure 
that our measure of local modularity lies on the inter- 
val < R < 1, where its value is directly proportional 




FIG. 1: An illustration of the division of an abstract graph 
into the local community C, its boundary B and the edges 
which connect B to the largely unknown neighbors U. 



to sharpness of the boundary given by B. This is true 
except when the entire component has been discovered, 
at which point R is undefined. If we like, we may set 
R = 1 in that case in order to match the intuitive no- 
tion that an entire component constitutes the strongest 
kind of community. Finally, there are certainly alterna- 
tive measures that can be defined on B, however, in this 
paper we consider only the one given. 



III. THE ALGORITHM 

For graphs like the World Wide Web, in which one 
must literally crawl the network in order to discover the 
adjacency matrix, any analysis of local community struc- 
ture must necessarily begin at some source vertex vq. In 
general, if the explored portion of the graph has A; ver- 
tices, the number of ways to partition it into two sets, 
those vertices considered a part of the same local com- 
munity as the source vertex and those considered outside 
of it, is given by 2 fe ~ 2 — 1, which is exponential in the size 
of the explored portion of the network. In this section, 
we describe an algorithm that only takes time polyno- 
mial in k, and that infers local community structure by 
using the vertex-at-a-time discovery process subject to 
maximizing our measure of local modularity. 

Initially, we place the source vertex in the community, 
vq = C, and place its neighbors in IA. At each step, the 
algorithm adds to C (and to B, if necessary) the neighbor- 
ing vertex that results in the largest increase (or smallest 
decrease) in R, breaking ties randomly. Finally, we add 
to IA any newly discovered vertices, and update our esti- 
mate of R. This process continues until it has agglomer- 
ated either a given number of vertices k, or it has discov- 
ered the entire enclosing component, whichever happens 
first. Pseudocode for this process is given in Algorithm 1. 
As we will see in the two subsequent sections, this algo- 
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Algorithm 1: The general algorithm for the greedy maximiza- 
tion of local modularity, as given in the text, 
add vo to C 

add all neighbors of vo to U 
set B = vq 
while \C\ < k do 

for each vj 6 IA do 
compute ARj 

end for 

find Vj such that ARj is maximum 
add that Vj to C 

add all new neighbors of that Vj to U 
update R and B 
end while 



rithm performs well on both computer-generated graphs 
with some known community structure and on real world 
graphs. 

The computation of the ARj associated with each Vj G 
IA can be done quickly using an expression derived from 
equation (@J: 

- x - R y~^~ R ^ , (5 ) 

T — z + y 

where x is the number of edges in T that terminated at 
Vj, y is the number of edges that will be added to T by 
the agglomeration of Vj (i.e., the degree of Vj is x + y), 
and z is the number of edges that will be removed from 
T by the agglomeration. Because ARj depends on the 
current value of R, and on the y and z that correspond 
to Vj , each step of the algorithm takes time proportional 
to the number of vertices in U. This is roughly kd, where 
d is the mean degree of the graph; we note that this will 
be a significant overestimate for graphs with non-trivial 
clustering coefficients, significant community structure, 
or when C is a large portion of the graph. Thus, in gen- 
eral, the running time for the algorithm is 0(k 2 d), or 
simply 0(k ) for a sparse graph, i.e., when m ~ n. As it 
agglomerates vertices, the algorithm outputs a function 
R(t), the local modularity of the community centered on 
vq after t steps, and a list of vertices paired with the time 
of their agglomeration. 

The above calculation of the running time is somewhat 
misleading as it assumes that the algorithm is dominated 
by the time required to calculate the ARj for each ver- 
tex in U] however, for graphs like the World Wide Web, 
where adding a new vertex to U requires the algorithm 
to fetch a web page from a remote server, the running 
time will instead be dominated by the time-consuming 
retrieval. When this is true, the running time is linear in 
the size of the explored subgraph, 0(k). 

A few comments regarding this algorithm are due. Be- 
cause of the greedy maximization of local modularity, a 
neighboring high degree vertex will not be agglomerated 
until the number of its unknown neighbors has decreased 
sufficiently. It is this behavior that allows the algorithm 
to avoid crossing a community boundary until absolutely 



necessary. Additionally, the algorithm is somewhat sen- 
sitive to the degree distribution of the source vertex's 
neighbors: when the source degree is high, the algorithm 
will first explore its low degree neighbors. This implicitly 
assumes that high degree vertices are likely to sit at the 
boundary of several local communities. While certainly 
not the case in general, this may be true for some real 
world networks. We shall return to this idea in a later 
section. 

Finally, although one could certainly stop the algo- 
rithm once the first enclosing community has been found, 
in principle, there is no reason that it cannot continue 
until some arbitrary number of vertices have been ag- 
glomerated. Doing so yields the hierarchy of commu- 
nities which enclose the source vertex. In a sense, this 
process is akin to the following: given the dendrogram of 
the global community hierarchy, walk upward toward the 
root from some leaf vq and observe the successive hier- 
archical relationships as represented by junctions in the 
dendrogram. In that sense, the enclosing communities 
inferred by our algorithm for some source vertex is the 
community hierarchy from the perspective of that vertex. 



IV. COMPUTER-GENERATED GRAPHS 

As has become standard with testing community in- 
ference techniques, we apply our algorithm to a set of 
computer-generated random graphs which have known 
community structure ^t|- I* 1 these graphs, n — 128 ver- 
tices are divided into four equal-sized communities of 32 
vertices. Each vertex has a total expected degree z which 
is divided between intra- and inter-community edges such 
that z = Zi n + z out . These edges are placed independently 
and at random so that, in expectation, the values of 2, n 
and z out are respected. By holding the expected degree 
constant z — 16, we may tune the sharpness of the com- 
munity boundaries by varying z out . Note that for these 
graphs, when z out = 12, edges between vertices in the 
same group are just as likely as edges between vertices 
that are not. 

Figure [21 shows the average local modularity R as a 
function of the number of steps t, over 500 realizations 
of the graphs described above. For the sake of clarity, 
only data series for z out < 6.0 are shown and error bars 
are omitted. Sharp community boundaries correspond to 
peaks in the curve. As z ou t grows, the sharpness of the 
boundaries and the height of the peaks decrease propor- 
tionally. When the first derivative is positive everywhere, 
e.g., for z ou t > 5, the inferred locations of the community 
boundaries may be extracted by finding local minima in 
the second derivative, possibly after some smoothing. 

From this information we may grade the performance 
of the algorithm on the computer-generated graphs. Fig- 
ure shows the average fraction of correctly classified 
vertices for each of the four communities as a function of 
Zout, over 500 realizations; error bars depict one standard 
deviation. As a method for inferring the first enclosing 
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FIG. 2: Local modularity R as a function of the number of 
steps t, averaged over 500 computer-generated networks as 
described in the text; error bars are omitted for clarity. By 
varying the expected number of inter-community edges per 
node z ou t, the strength of the community boundaries are var- 
ied. 



FIG. 3: Fraction of correctly classified nodes, by commu- 
nity, as a function of the number of inter-community edges 
Zout- Although there the variance increases as the commu- 
nity boundaries become less sharp, the average behavior (over 
500 realizations) degrades gracefully, and compares favorably 
with methods which use global information. 



community, our algorithm classifies more than 50% of the 
vertices correctly even when the boundaries are weak, 
i.e., when z out = 8. Although the variance in the quality 
of classification grows as z out approaches Zi n , this is to be 
expected given that the algorithm uses only local infor- 
mation for its inference, and large local fluctuations may 
mislead the algorithm. For computer-generated graphs 
such as these, the performance of our algorithm compares 
favorably to that of more global methods [H H Eg . 

Recently, another approach to inferring community 
structure using only local information appeared |24| . 
This alternative technique relies upon growing a breadth- 
first tree outward from the source vertex vq , until the rate 
of expansion falls below an arbitrary threshold. The uni- 
form exploration has the property that some level in the 
tree will correspond to a good partitioning only when vq 
is equidistant from all parts of its enclosing community's 
boundary. On the other hand, by exploring the surround- 
ing graph one vertex at a time, our algorithm will avoid 
crossing boundaries until it has explored the remainder 
of the enclosing community. 



V. LOCAL CO-PURCHASING HABITS 

In this section, we apply our local inference algo- 
rithm to the recommender network of Amazon.com, col- 
lected in August 2003, which has n — 409 687 vertices, 
m = 2 464 630 edges and thus a mean degree of 12.03. We 
note that the degree distribution is fairly right-skewed, 
having a standard deviation of 14.64. Here, vertices are 
items such as books and digital media sold on Ama- 
zon's website, while edges connect pairs of items that 



are frequently purchased together by customers. It is 
this co-purchasing data that yields recommendations for 
customers as they browse the online store. Although in 
general, the algorithm we have described is intended for 
graphs like the World Wide Web, the Amazon recom- 
mender network has the advantage that, by virtue of be- 
ing both very large and fully known, we may explore 
global regularities in local community structure without 
concern for sampling bias in the choice of source vertices. 
Additionally, we may check the inferred the community 
structures against our, admittedly heuristic, notions of 
correctness. 

As illustrative examples, we choose three qualita- 
tively different items as source vertices: the compact 
disc Alegria by Cirque du Soleil, the book Small Worlds 
by Duncan Watts, and the book Harry Potter and the 
Order of the Phoenix by J.K. Rowling. These items have 
degree 15, 19 and 3117 respectively. At the time the net- 
work data was collected, the Harry Potter book was the 
highest degree vertex in the network, its release date hav- 
ing been June 2003. For each of these items, we explore 
k = 25 000 associated vertices. Figure 0] illustrates the 
local modularity as a function of the number of steps t for 
each item; an analogous data series for a random graph 
with the same degree distribution |25j has been plotted 
for comparison. We mark the locations of the five prin- 
ciple enclosing communities with large open symbols. 

These time series have several distinguishing features. 
First, Alegria has the smallest enclosing communities, 
composed of t = {10,30,39,58,78} vertices, and these 
communities are associated with high values of local mod- 
ularity. The first five enclosing communities all have R > 
0.62, while the third community corresponds to R = 0.81, 
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FIG. 4: Local modularity R for three items in the Ama- 
zon.com recommender network, shown on log-linear axes. For 
comparison, the time series for a random graph with the same 
degree distribution is shown. The large open symbols indicate 
the locations of the five strongest enclosing communities. 




FIG. 5: The first three enclosing communities for Cirque 
du Soleil's Alegria in Amazon. corn's recommender network; 
communities are distinguished by shape (circles, diamonds, 
squares respectively) . Connections to triangles represent con- 
nections to items in the remaining unknown portion of the 
graph. Alegria and Order of the Phoenix are denoted by 1 
and 4 respectively. 



indicating that only about 20% of boundary edges reach 
out to the rest of the network. In contrast, the communi- 
ties of Small Worlds contain t = {36,48,69,82,94} ver- 
tices, while the Harry Potter book's communities are ex- 
tremely large, containing t = {607, 883, 1270, 1374, 1438} 
vertices. Both sets have only moderate values of local 
modularity, R < 0.43. It is notable that the local mod- 
ularity functions for all three items follow relatively dis- 
tinct trajectories until the algorithm has agglomerated 
roughly 10 000 items. Beyond that point, the curves be- 
gin to converge, indicating that, from the perspectives 
of the source vertices, the local community structure has 
become relatively similar. 

To illustrate the inferred local structure, we show the 
partial subgraph that corresponds to the first three en- 
closing local communities for the compact disc Alegria in 
Figure [S] Here, communities are distinguished by shape 
according to the order of discovery (circle, diamond and 
square respectively), and vertices beyond these commu- 
nities are denoted by triangles. Items in the first enclos- 
ing community are uniformly compact discs produced by 
Cirque du Soleil. Items in the second are slightly more di- 
verse, including movies and books about the troupe, the 
Cirque du Soleil compact disc entitled Varekai, and one 
compact disc by a band called Era; the third group con- 
tains both new and old Cirque du Soleil movies. Varekai 
appears to have been placed outside the first commu- 
nity because it has fewer connections to those items 
than to items in the subsequent enclosing communities. 
Briefly, we find that the enclosing local communities of 
Small Worlds are populated by texts in sociology and 
social network analysis, while the Harry Potter book's 
communities have little topical similarity. 

In Figure the labels 1 and 4 denote the items Ale- 



gria and Order of the Phoenix, respectively. It is notable 
that these items are only three steps away in the graph, 
yet have extremely different local community structures 
(Fig. 01 . If an item's popularity is reflected by its de- 
gree, then it is reasonable to believe that the strength 
of the source vertex's local community structure may be 
inversely related to its degree. That is, popular items 
like Order of the Phoenix may tend to link many well- 
defined communities by virtue of being purchased by a 
large number of customers with diverse interests, while 
niche items like Cirque du Soleil's Alegria exhibit stronger 
local community structure as the result of more specific 
co-purchasing habits. Such structure appears to be dis- 
tinct from both the macroscopic structure discovered us- 
ing global community inference methods |2l| . and the 
microscopic structure of simple vertex-statistics such as 
clustering or assortativity. 

With the exception of social networks, the degree of 
adjacent vertices appears to be negatively correlated in 
most networks. This property is often called "disas- 
sortative" mixing (2(|, and can be caused by a high 
clustering coefficient, global community structure or a 
specific social mechanism [2^. However, for the Ama- 
zon recommender network, we find that the assorta- 
tivity coefficient is not statistically different from zero, 
r = -3.01 x 10~ 19 ±1.49 x 10~ 4 , yet the network exhibits 
a non-trivial clustering coefficient, c = 0.17 and strong 
global community structure structure with a peak mod- 
ularity of Q = 0.745 |2l|. Returning to the suggestion 
above that there is an inverse relationship between the 
degree of the source vertex and the strength of its sur- 
rounding community structure, we sample for 100 000 
random vertices the average local modularity over the 
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FIG. 6: The average local modularity over the first 250 steps 
for source vertices with degree at least d. The "knee" in 
the upper data series is located at d = 13; the mean degree 
for the network is 12.03. The logarithmic falloff illustrates 
the negative correlation between source vertex degree and the 
strength of the surrounding local community. 



first k = 250 steps. We find the average local modular- 
ity to be relatively high, R am zn — 0.49 ± 0.08, while a 
random graph with the same degree distribution yields 
Rrand = 0.16 ±0.01. The variance for the Amazon graph 
is due to the contributions of high degree vertices. In 
Figure El we plot from our random sample, the average 
local modularity for all source vertices with degree at 
least d. Notably, the average is relatively constant until 
d = 13, after which it falls off logarithmically. This sup- 
ports the hypothesis that, in the recommender network, 
there is a weak inverse relationship between the degree 
of the source vertex and the strength of its surrounding 
local community. 

Naturally, there are many ways to use the concept of 
local community structure to understand the mesoscopic 
properties of real world networks. Further characteri- 
zations of the Amazon graph are beyond the scope of 
this paper, but we propose a rigorous exploration of the 
relationship between the source vertex degree and its sur- 
rounding local community structure as a topic for future 
work. 



VI. CONCLUSIONS 

Although many recent algorithms have appeared in the 
physics literature for the inference of community struc- 



ture when the entire graph structure is known, there has 
been little consideration of graphs that are either too 
large for even the fastest known techniques, or that are, 
like the World Wide Web, too large or too dynamic to 
ever be fully known. Here, we define a measure of com- 
munity structure which depends only on the topology 
of some known portion of a graph. We then give a sim- 
ple fast, agglomerative algorithm that greedily maximizes 
our measure as it explores the graph one vertex at a time. 
When the time it takes to retrieve the adjacencies of a 
vertex is small, this algorithm runs in time 0(k d) for 
general graphs when it explores k vertices and the graph 
has mean degree d. For sparse graphs, i.e., when m ~ n, 
this is simply 0{k 2 ). On the other hand, when visiting a 
new vertex to retrieve its adjacencies dominates the run- 
ning time, e.g., downloading a web page on the World 
Wide Web, the algorithm takes time linear in the size 
of the explored subgraph, 0(k). Generally, if we are in- 
terested in making quantitative statements about local 
structure, that is, when k -C n, it is much more reason- 
able to use an algorithm which is linear or even quadratic 
in k, than an algorithm that is linear in the size of the 
graph n. Finally, we note that our algorithm's simplicity 
will make it especially easy to incoporate into web spider 
or crawler programs for the discovery of local community 
structures on the World Wide Web graph. 

Using computer-generated graphs with known commu- 
nity structure, we show that our algorithm extracts this 
structure and that its performance compares favor ably 
with other community structure algorithms |l7l Il8l ITsj 
that rely on global information. We then apply our al- 
gorithm to the large recommender network of the online 
retailer Amazon.com, and extract the local hierarchy of 
communities for several qualitatively distinct items. We 
further show that a vertex's degree is inversely related to 
the strength of its surrounding local structure. This dis- 
covery points to the existence of mesoscopic topological 
regularities that have not been characterized previously. 
Finally, this algorithm should allow researchers to char- 
acterize the structure of a wide variety of other graphs, 
and we look forward to seeing such applications. 
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